-
Notifications
You must be signed in to change notification settings - Fork 3.3k
Add extra RL files #2077
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Add extra RL files #2077
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
ArEsKay3
approved these changes
Oct 31, 2025
Contributor
ArEsKay3
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Contributor
Author
|
/ok to test |
@tdene, there was an error processing your request: See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/ |
Contributor
Author
|
/ok to test 0c14bc6 |
Jianbing-D
pushed a commit
to Jianbing-D/Megatron-LM
that referenced
this pull request
Nov 12, 2025
* ci: Move test optimizer into its own bucket (NVIDIA#1909) Signed-off-by: oliver könig <[email protected]> * ci: Use matrix for approval-bot Signed-off-by: oliver könig <[email protected]> * ci: Update function name Signed-off-by: oliver könig <[email protected]> * ci: Adjust approval-bot for copy-pr-bot Signed-off-by: oliver könig <[email protected]> * ci: Parametrize workflow Signed-off-by: oliver könig <[email protected]> * ci: Parametrize workflow Signed-off-by: oliver könig <[email protected]> * ci: Remove attribute Signed-off-by: oliver könig <[email protected]> * ci: Update container image tag to use GitHub SHA * chore: Remove file * ci: Fix approval bot Signed-off-by: oliver könig <[email protected]> * ci: Configure cherrypick bot (NVIDIA#1925) Signed-off-by: oliver könig <[email protected]> * Ci approve dev (NVIDIA#1933) Signed-off-by: oliver könig <[email protected]> * ci: Update nightly schedule (NVIDIA#1934) Signed-off-by: oliver könig <[email protected]> * ci: Bump pre-flight for runs on main/dev (NVIDIA#1935) Signed-off-by: oliver könig <[email protected]> * ci: Allow skipping on main (NVIDIA#1936) Signed-off-by: oliver könig <[email protected]> * Ko3n1g/ci/pr template community bot (NVIDIA#1937) * ci: More granular unit tests buckets (NVIDIA#1932) Signed-off-by: oliver könig <[email protected]> * Add sequence packing to RL (NVIDIA#1911) Add sequence packing to RL * chore: Update template (NVIDIA#1939) Signed-off-by: oliver könig <[email protected]> * chore: Add description about who can merge (NVIDIA#1940) Signed-off-by: oliver könig <[email protected]> * Ko3n1g/ci/fix main on eos (NVIDIA#1938) Signed-off-by: oliver könig <[email protected]> * Ko3n1g/ci/internal mrs (NVIDIA#1942) Signed-off-by: oliver könig <[email protected]> * ci: Fix branch of approval bot (NVIDIA#1944) Signed-off-by: oliver könig <[email protected]> * ci: Approvalbot for other branches (NVIDIA#1947) Signed-off-by: oliver könig <[email protected]> * ci(fix): Approval bot (NVIDIA#1949) Signed-off-by: oliver könig <[email protected]> * ci(fix): Approval gate Signed-off-by: oliver könig <[email protected]> * ci: Approval gate rule Signed-off-by: oliver könig <[email protected]> * ci: Update golden values nightly Signed-off-by: oliver könig <[email protected]> * ci: Approval gate Signed-off-by: oliver könig <[email protected]> * ci: Approval bot Signed-off-by: oliver könig <[email protected]> * ci: Sync branches Signed-off-by: oliver könig <[email protected]> * ci: Smaller image Signed-off-by: oliver könig <[email protected]> * ci: Better output Signed-off-by: oliver könig <[email protected]> * ci: sync branches Signed-off-by: oliver könig <[email protected]> * ci: Fix sync bot Signed-off-by: oliver könig <[email protected]> * ci: Finalize Signed-off-by: oliver könig <[email protected]> * ci: Finalize Signed-off-by: oliver könig <[email protected]> * Ko3n1g/ci/sync branches (NVIDIA#1956) Signed-off-by: oliver könig <[email protected]> * ci: Increase time limit for main tests Signed-off-by: oliver könig <[email protected]> * Ko3n1g/ci/add milestone (NVIDIA#1951) Signed-off-by: oliver könig <[email protected]> * Remove M-FSDP testing under LTS environment (NVIDIA#1959) * ci: Run on push to release branch (NVIDIA#1960) Signed-off-by: oliver könig <[email protected]> * ci: Add golden values for inference Signed-off-by: oliver könig <[email protected]> * Fix typo in rl section of CODEOWNERS (NVIDIA#1968) * ci: Update copyright checker (NVIDIA#1973) Signed-off-by: oliver könig <[email protected]> * Ko3n1g/ci/auto reminder GitHub (NVIDIA#1955) Signed-off-by: oliver könig <[email protected]> * ci: Update secret Signed-off-by: oliver könig <[email protected]> * ci(fix): `Run tests` label (NVIDIA#1970) Signed-off-by: oliver könig <[email protected]> * ci(hotfix): Disable tests again Signed-off-by: oliver könig <[email protected]> * ci(hotfix): Add merge-group to copyright check Signed-off-by: oliver könig <[email protected]> * ci(hotfix): Copyright check on merge-queue Signed-off-by: oliver könig <[email protected]> * zarr soft deprecation (NVIDIA#2004) Signed-off-by: dimapihtar <[email protected]> Co-authored-by: oliver könig <[email protected]> * Make `get_asyncio_loop` safe to use repeatedly (NVIDIA#1990) Co-authored-by: oliver könig <[email protected]> * Update symmetric registration interface to sync-up with upstream pytorch change (NVIDIA#1924) Signed-off-by: Youngeun Kwon <[email protected]> Signed-off-by: Youngeun <[email protected]> Co-authored-by: oliver könig <[email protected]> * chore: Update codeowners (NVIDIA#2012) Signed-off-by: oliver könig <[email protected]> * Deduplicate dynamic engine + coordinator. (NVIDIA#1981) Co-authored-by: Mcore Bot <[email protected]> Co-authored-by: oliver könig <[email protected]> * Safely access state dict args in load ckpt (NVIDIA#1957) Signed-off-by: Maanu Grover <[email protected]> * Allow mixed-batch sampling in dynamic inference (NVIDIA#1927) * Stop Nemo_CICD_Test from failing in forks (NVIDIA#2024) * Clean up dynamic inference step (NVIDIA#1992) Co-authored-by: Lawrence McAfee <[email protected]> Co-authored-by: oliver könig <[email protected]> * ci: Auto-update copy-pr-bot vetters (NVIDIA#1850) Signed-off-by: oliver könig <[email protected]> Co-authored-by: AJ Schmidt <[email protected]> * Have datasets account for tokenizers which incorrectly define PAD (NVIDIA#2017) * ci: Enable integration tests (NVIDIA#2023) Signed-off-by: oliver könig <[email protected]> * ci: Fix build-push-wheel workflow (NVIDIA#2022) Signed-off-by: oliver könig <[email protected]> * chore: Update tooling for interactive jobs (NVIDIA#2032) Signed-off-by: oliver könig <[email protected]> * revert(hotfix): ci: trustees_override (NVIDIA#2041) Signed-off-by: oliver könig <[email protected]> * add missing warnings import in model parallel config (NVIDIA#2039) Signed-off-by: ykarnati <[email protected]> * Reduce-scatter implementation with FP32 accumulation (NVIDIA#1967) Signed-off-by: Deepak Narayanan <[email protected]> * ci(fix): Workflows on `main` (NVIDIA#2045) Signed-off-by: oliver könig <[email protected]> * build: Bump modelopt (NVIDIA#2046) Signed-off-by: oliver könig <[email protected]> * Remove TestCaptureFreezeGC unit test. (NVIDIA#1978) * ci: Add multi-approval action (NVIDIA#2051) Signed-off-by: oliver könig <[email protected]> * ci(hotfix): Repair codeowners file * ci(hotfix): Set docs allowed to fail Signed-off-by: oliver könig <[email protected]> * Ko3n1g/ci/test iteration time (NVIDIA#2067) Signed-off-by: oliver könig <[email protected]> * ci(hotfix): Remove performance for ckpt-resume Signed-off-by: oliver könig <[email protected]> * Allow inference test throughput to vary by 10% (NVIDIA#2070) * ci(hotfix): Inference test pipeline Signed-off-by: oliver könig <[email protected]> * chore: Fix autoformatter (NVIDIA#2073) Signed-off-by: oliver könig <[email protected]> * ci(hotfix): Remove iteration-time from t5 Signed-off-by: oliver könig <[email protected]> * ci(hotfix): disable inference test Signed-off-by: oliver könig <[email protected]> * ci(hotfix): Disable inference test Signed-off-by: oliver könig <[email protected]> * ci(hotfix): Bypass approvalbot in merge-queue (NVIDIA#2082) Signed-off-by: oliver könig <[email protected]> * ci(hotfix): Enable merge-group for approval bot Signed-off-by: oliver könig <[email protected]> * chore: Update local tooling (NVIDIA#2066) Signed-off-by: oliver könig <[email protected]> * Add extra RL files (NVIDIA#2077) Co-authored-by: Robert Kirby <[email protected]> Co-authored-by: oliver könig <[email protected]> * Prevent summary jobs from running in forks (NVIDIA#2083) Co-authored-by: oliver könig <[email protected]> * ci: Fix test scope (NVIDIA#2091) Signed-off-by: oliver könig <[email protected]> * ci(hotfix): Remove publish workflows Signed-off-by: oliver könig <[email protected]> * Refactor the attention metadata into separate classes (NVIDIA#2001) Co-authored-by: Siddharth Singh <[email protected]> Co-authored-by: oliver könig <[email protected]> * Guard against incorrectly using MoE prefill graphs (NVIDIA#2030) Co-authored-by: oliver könig <[email protected]> * Revert "Refactor the attention metadata into separate classes (NVIDIA#2001)" This reverts commit a652e2c. * Run mr-slim tests in lightweight-mode (NVIDIA#2106) Signed-off-by: Charlie Truong <[email protected]> * Inference | Lazy compile UVM allocator. (NVIDIA#1977) Co-authored-by: Mcore Bot <[email protected]> Co-authored-by: oliver könig <[email protected]> * chore: Reenable trustees (NVIDIA#2108) Signed-off-by: oliver könig <[email protected]> * Revert "Inference | Lazy compile UVM allocator. (NVIDIA#1977)" This reverts commit 7487c53. * ci(fix): Changeset of copyright checker (NVIDIA#2110) Signed-off-by: oliver könig <[email protected]> * Ko3n1g/chore/update release settings (NVIDIA#2097) Signed-off-by: oliver könig <[email protected]> * Remove unnecessary check on rotary_pos_cos (NVIDIA#2003) Signed-off-by: Keshav Santhanam <[email protected]> * (Reverted) Inference | Lazy compile UVM allocator. (NVIDIA#2125) Co-authored-by: Mcore Bot <[email protected]> Co-authored-by: oliver könig <[email protected]> * Refactor Attention Metadata to Separate Classes (NVIDIA#2112) Co-authored-by: Siddharth Singh <[email protected]> Co-authored-by: oliver könig <[email protected]> * Refactor model_provider to model_builder format for ModelOpt examples (NVIDIA#2107) * wandb Inference stats logging (NVIDIA#2026) Co-authored-by: root <[email protected]> Co-authored-by: William Dykas <[email protected]> Co-authored-by: root <[email protected]> * Make `PipelineParallelLayout` always return str from ` __repr__` (NVIDIA#2055) Signed-off-by: Ananth Subramaniam <[email protected]> Co-authored-by: oliver könig <[email protected]> * Add flash_attn_3 as first option for FA3 import (NVIDIA#2010) Signed-off-by: Keshav Santhanam <[email protected]> Co-authored-by: oliver könig <[email protected]> Co-authored-by: Teodor-Dumitru Ene <[email protected]> * Add debugging hint for case when cudagraphs are created but no matching runner is found (NVIDIA#2129) * ci: LTS container (NVIDIA#2133) Signed-off-by: oliver könig <[email protected]> * Revert "ci: LTS container (NVIDIA#2133)" This reverts commit eb48e81. * Fix param init (NVIDIA#2033) Signed-off-by: Chen Cui <[email protected]> * Hotfix to unit tests on hopper FA3 (NVIDIA#2143) * Add BytesIO to safe_globals (NVIDIA#2074) * add deprecation warning for legacy tokenizer system (NVIDIA#2145) Signed-off-by: dimapihtar <[email protected]> * replay: ci: Bump LTS container (NVIDIA#2157) Signed-off-by: oliver könig <[email protected]> * Hotfix to unit tests on hopper FA3 (bis) (NVIDIA#2179) * Fix has_modelopt_state() for native Torch checkpoint format (NVIDIA#2160) Signed-off-by: Asha Anoosheh <[email protected]> * chore: Remove codeowners (NVIDIA#2175) Signed-off-by: oliver könig <[email protected]> * Fix FP8 inference with sequence parallelism (NVIDIA#2009) Signed-off-by: Keshav Santhanam <[email protected]> Co-authored-by: Teodor-Dumitru Ene <[email protected]> * Replace ModelOpt generation server (NVIDIA#2147) Signed-off-by: Asha Anoosheh <[email protected]> * Add hybrid model support for dynamic inference engine (NVIDIA#1907) Signed-off-by: Keshav Santhanam <[email protected]> Co-authored-by: Teodor-Dumitru Ene <[email protected]> * Async task and event loop safety in Megatron Core (NVIDIA#2025) Co-authored-by: Robert Kirby <[email protected]> * Rename skip_prompt_log_probs (NVIDIA#2181) * Dynamic inference context | UVM only. (NVIDIA#1983) Co-authored-by: Robert Kirby <[email protected]> Co-authored-by: Mcore Bot <[email protected]> Co-authored-by: Teodor-Dumitru Ene <[email protected]> * Update copy-pr-bot.yaml [skip ci] Signed-off-by: oliver könig <[email protected]> * Revert "Dynamic inference context | UVM only. (NVIDIA#1983)" This reverts commit d6979d6. * ci: Run `auto-update-copy-pr-bot` only on forks (NVIDIA#2191) Signed-off-by: oliver könig <[email protected]> * Inference throughput tests: refactor goldens to be in list format (NVIDIA#2072) * Enable TE custom quantization recipe (NVIDIA#2005) Signed-off-by: Evgeny <[email protected]> Signed-off-by: root <Evgeny> Co-authored-by: oliver könig <[email protected]> Co-authored-by: root <Evgeny> * Remove redundant logits calculations in gpt_model --------- Signed-off-by: oliver könig <[email protected]> Signed-off-by: dimapihtar <[email protected]> Signed-off-by: Youngeun Kwon <[email protected]> Signed-off-by: Youngeun <[email protected]> Signed-off-by: Maanu Grover <[email protected]> Signed-off-by: ykarnati <[email protected]> Signed-off-by: Deepak Narayanan <[email protected]> Signed-off-by: Charlie Truong <[email protected]> Signed-off-by: Keshav Santhanam <[email protected]> Signed-off-by: Ananth Subramaniam <[email protected]> Signed-off-by: Chen Cui <[email protected]> Signed-off-by: Asha Anoosheh <[email protected]> Signed-off-by: Evgeny <[email protected]> Signed-off-by: root <Evgeny> Co-authored-by: oliver könig <[email protected]> Co-authored-by: Teodor-Dumitru Ene <[email protected]> Co-authored-by: Dmytro Pykhtar <[email protected]> Co-authored-by: Youngeun Kwon <[email protected]> Co-authored-by: Lawrence McAfee <[email protected]> Co-authored-by: Mcore Bot <[email protected]> Co-authored-by: Maanu Grover <[email protected]> Co-authored-by: Lawrence McAfee <[email protected]> Co-authored-by: AJ Schmidt <[email protected]> Co-authored-by: Yashaswi Karnati <[email protected]> Co-authored-by: Deepak Narayanan <[email protected]> Co-authored-by: helen ngo <[email protected]> Co-authored-by: Robert Kirby <[email protected]> Co-authored-by: kanz-nv <[email protected]> Co-authored-by: Siddharth Singh <[email protected]> Co-authored-by: Charlie Truong <[email protected]> Co-authored-by: Keshav Santhanam <[email protected]> Co-authored-by: Asha Anoosheh <[email protected]> Co-authored-by: wdykas <[email protected]> Co-authored-by: root <[email protected]> Co-authored-by: William Dykas <[email protected]> Co-authored-by: root <[email protected]> Co-authored-by: Ananth Subramaniam <[email protected]> Co-authored-by: Chen Cui <[email protected]> Co-authored-by: Teodor-Dumitru Ene <[email protected]> Co-authored-by: Robert Kirby <[email protected]> Co-authored-by: Evgeny Tsykunov <[email protected]>
Jianbing-D
pushed a commit
to Jianbing-D/Megatron-LM
that referenced
this pull request
Nov 12, 2025
* ci: Move test optimizer into its own bucket (NVIDIA#1909) Signed-off-by: oliver könig <[email protected]> * ci: Use matrix for approval-bot Signed-off-by: oliver könig <[email protected]> * ci: Update function name Signed-off-by: oliver könig <[email protected]> * ci: Adjust approval-bot for copy-pr-bot Signed-off-by: oliver könig <[email protected]> * ci: Parametrize workflow Signed-off-by: oliver könig <[email protected]> * ci: Parametrize workflow Signed-off-by: oliver könig <[email protected]> * ci: Remove attribute Signed-off-by: oliver könig <[email protected]> * ci: Update container image tag to use GitHub SHA * chore: Remove file * ci: Fix approval bot Signed-off-by: oliver könig <[email protected]> * ci: Configure cherrypick bot (NVIDIA#1925) Signed-off-by: oliver könig <[email protected]> * Ci approve dev (NVIDIA#1933) Signed-off-by: oliver könig <[email protected]> * ci: Update nightly schedule (NVIDIA#1934) Signed-off-by: oliver könig <[email protected]> * ci: Bump pre-flight for runs on main/dev (NVIDIA#1935) Signed-off-by: oliver könig <[email protected]> * ci: Allow skipping on main (NVIDIA#1936) Signed-off-by: oliver könig <[email protected]> * Ko3n1g/ci/pr template community bot (NVIDIA#1937) * ci: More granular unit tests buckets (NVIDIA#1932) Signed-off-by: oliver könig <[email protected]> * Add sequence packing to RL (NVIDIA#1911) Add sequence packing to RL * chore: Update template (NVIDIA#1939) Signed-off-by: oliver könig <[email protected]> * chore: Add description about who can merge (NVIDIA#1940) Signed-off-by: oliver könig <[email protected]> * Ko3n1g/ci/fix main on eos (NVIDIA#1938) Signed-off-by: oliver könig <[email protected]> * Ko3n1g/ci/internal mrs (NVIDIA#1942) Signed-off-by: oliver könig <[email protected]> * ci: Fix branch of approval bot (NVIDIA#1944) Signed-off-by: oliver könig <[email protected]> * ci: Approvalbot for other branches (NVIDIA#1947) Signed-off-by: oliver könig <[email protected]> * ci(fix): Approval bot (NVIDIA#1949) Signed-off-by: oliver könig <[email protected]> * ci(fix): Approval gate Signed-off-by: oliver könig <[email protected]> * ci: Approval gate rule Signed-off-by: oliver könig <[email protected]> * ci: Update golden values nightly Signed-off-by: oliver könig <[email protected]> * ci: Approval gate Signed-off-by: oliver könig <[email protected]> * ci: Approval bot Signed-off-by: oliver könig <[email protected]> * ci: Sync branches Signed-off-by: oliver könig <[email protected]> * ci: Smaller image Signed-off-by: oliver könig <[email protected]> * ci: Better output Signed-off-by: oliver könig <[email protected]> * ci: sync branches Signed-off-by: oliver könig <[email protected]> * ci: Fix sync bot Signed-off-by: oliver könig <[email protected]> * ci: Finalize Signed-off-by: oliver könig <[email protected]> * ci: Finalize Signed-off-by: oliver könig <[email protected]> * Ko3n1g/ci/sync branches (NVIDIA#1956) Signed-off-by: oliver könig <[email protected]> * ci: Increase time limit for main tests Signed-off-by: oliver könig <[email protected]> * Ko3n1g/ci/add milestone (NVIDIA#1951) Signed-off-by: oliver könig <[email protected]> * Remove M-FSDP testing under LTS environment (NVIDIA#1959) * ci: Run on push to release branch (NVIDIA#1960) Signed-off-by: oliver könig <[email protected]> * ci: Add golden values for inference Signed-off-by: oliver könig <[email protected]> * Fix typo in rl section of CODEOWNERS (NVIDIA#1968) * ci: Update copyright checker (NVIDIA#1973) Signed-off-by: oliver könig <[email protected]> * Ko3n1g/ci/auto reminder GitHub (NVIDIA#1955) Signed-off-by: oliver könig <[email protected]> * ci: Update secret Signed-off-by: oliver könig <[email protected]> * ci(fix): `Run tests` label (NVIDIA#1970) Signed-off-by: oliver könig <[email protected]> * ci(hotfix): Disable tests again Signed-off-by: oliver könig <[email protected]> * ci(hotfix): Add merge-group to copyright check Signed-off-by: oliver könig <[email protected]> * ci(hotfix): Copyright check on merge-queue Signed-off-by: oliver könig <[email protected]> * zarr soft deprecation (NVIDIA#2004) Signed-off-by: dimapihtar <[email protected]> Co-authored-by: oliver könig <[email protected]> * Make `get_asyncio_loop` safe to use repeatedly (NVIDIA#1990) Co-authored-by: oliver könig <[email protected]> * Update symmetric registration interface to sync-up with upstream pytorch change (NVIDIA#1924) Signed-off-by: Youngeun Kwon <[email protected]> Signed-off-by: Youngeun <[email protected]> Co-authored-by: oliver könig <[email protected]> * chore: Update codeowners (NVIDIA#2012) Signed-off-by: oliver könig <[email protected]> * Deduplicate dynamic engine + coordinator. (NVIDIA#1981) Co-authored-by: Mcore Bot <[email protected]> Co-authored-by: oliver könig <[email protected]> * Safely access state dict args in load ckpt (NVIDIA#1957) Signed-off-by: Maanu Grover <[email protected]> * Allow mixed-batch sampling in dynamic inference (NVIDIA#1927) * Stop Nemo_CICD_Test from failing in forks (NVIDIA#2024) * Clean up dynamic inference step (NVIDIA#1992) Co-authored-by: Lawrence McAfee <[email protected]> Co-authored-by: oliver könig <[email protected]> * ci: Auto-update copy-pr-bot vetters (NVIDIA#1850) Signed-off-by: oliver könig <[email protected]> Co-authored-by: AJ Schmidt <[email protected]> * Have datasets account for tokenizers which incorrectly define PAD (NVIDIA#2017) * ci: Enable integration tests (NVIDIA#2023) Signed-off-by: oliver könig <[email protected]> * ci: Fix build-push-wheel workflow (NVIDIA#2022) Signed-off-by: oliver könig <[email protected]> * chore: Update tooling for interactive jobs (NVIDIA#2032) Signed-off-by: oliver könig <[email protected]> * revert(hotfix): ci: trustees_override (NVIDIA#2041) Signed-off-by: oliver könig <[email protected]> * add missing warnings import in model parallel config (NVIDIA#2039) Signed-off-by: ykarnati <[email protected]> * Reduce-scatter implementation with FP32 accumulation (NVIDIA#1967) Signed-off-by: Deepak Narayanan <[email protected]> * ci(fix): Workflows on `main` (NVIDIA#2045) Signed-off-by: oliver könig <[email protected]> * build: Bump modelopt (NVIDIA#2046) Signed-off-by: oliver könig <[email protected]> * Remove TestCaptureFreezeGC unit test. (NVIDIA#1978) * ci: Add multi-approval action (NVIDIA#2051) Signed-off-by: oliver könig <[email protected]> * ci(hotfix): Repair codeowners file * ci(hotfix): Set docs allowed to fail Signed-off-by: oliver könig <[email protected]> * Ko3n1g/ci/test iteration time (NVIDIA#2067) Signed-off-by: oliver könig <[email protected]> * ci(hotfix): Remove performance for ckpt-resume Signed-off-by: oliver könig <[email protected]> * Allow inference test throughput to vary by 10% (NVIDIA#2070) * ci(hotfix): Inference test pipeline Signed-off-by: oliver könig <[email protected]> * chore: Fix autoformatter (NVIDIA#2073) Signed-off-by: oliver könig <[email protected]> * ci(hotfix): Remove iteration-time from t5 Signed-off-by: oliver könig <[email protected]> * ci(hotfix): disable inference test Signed-off-by: oliver könig <[email protected]> * ci(hotfix): Disable inference test Signed-off-by: oliver könig <[email protected]> * ci(hotfix): Bypass approvalbot in merge-queue (NVIDIA#2082) Signed-off-by: oliver könig <[email protected]> * ci(hotfix): Enable merge-group for approval bot Signed-off-by: oliver könig <[email protected]> * chore: Update local tooling (NVIDIA#2066) Signed-off-by: oliver könig <[email protected]> * Add extra RL files (NVIDIA#2077) Co-authored-by: Robert Kirby <[email protected]> Co-authored-by: oliver könig <[email protected]> * Prevent summary jobs from running in forks (NVIDIA#2083) Co-authored-by: oliver könig <[email protected]> * ci: Fix test scope (NVIDIA#2091) Signed-off-by: oliver könig <[email protected]> * ci(hotfix): Remove publish workflows Signed-off-by: oliver könig <[email protected]> * Refactor the attention metadata into separate classes (NVIDIA#2001) Co-authored-by: Siddharth Singh <[email protected]> Co-authored-by: oliver könig <[email protected]> * Guard against incorrectly using MoE prefill graphs (NVIDIA#2030) Co-authored-by: oliver könig <[email protected]> * Revert "Refactor the attention metadata into separate classes (NVIDIA#2001)" This reverts commit a652e2c. * Run mr-slim tests in lightweight-mode (NVIDIA#2106) Signed-off-by: Charlie Truong <[email protected]> * Inference | Lazy compile UVM allocator. (NVIDIA#1977) Co-authored-by: Mcore Bot <[email protected]> Co-authored-by: oliver könig <[email protected]> * chore: Reenable trustees (NVIDIA#2108) Signed-off-by: oliver könig <[email protected]> * Revert "Inference | Lazy compile UVM allocator. (NVIDIA#1977)" This reverts commit 7487c53. * ci(fix): Changeset of copyright checker (NVIDIA#2110) Signed-off-by: oliver könig <[email protected]> * Ko3n1g/chore/update release settings (NVIDIA#2097) Signed-off-by: oliver könig <[email protected]> * Remove unnecessary check on rotary_pos_cos (NVIDIA#2003) Signed-off-by: Keshav Santhanam <[email protected]> * (Reverted) Inference | Lazy compile UVM allocator. (NVIDIA#2125) Co-authored-by: Mcore Bot <[email protected]> Co-authored-by: oliver könig <[email protected]> * Refactor Attention Metadata to Separate Classes (NVIDIA#2112) Co-authored-by: Siddharth Singh <[email protected]> Co-authored-by: oliver könig <[email protected]> * Refactor model_provider to model_builder format for ModelOpt examples (NVIDIA#2107) * wandb Inference stats logging (NVIDIA#2026) Co-authored-by: root <[email protected]> Co-authored-by: William Dykas <[email protected]> Co-authored-by: root <[email protected]> * Make `PipelineParallelLayout` always return str from ` __repr__` (NVIDIA#2055) Signed-off-by: Ananth Subramaniam <[email protected]> Co-authored-by: oliver könig <[email protected]> * Add flash_attn_3 as first option for FA3 import (NVIDIA#2010) Signed-off-by: Keshav Santhanam <[email protected]> Co-authored-by: oliver könig <[email protected]> Co-authored-by: Teodor-Dumitru Ene <[email protected]> * Add debugging hint for case when cudagraphs are created but no matching runner is found (NVIDIA#2129) * ci: LTS container (NVIDIA#2133) Signed-off-by: oliver könig <[email protected]> * Revert "ci: LTS container (NVIDIA#2133)" This reverts commit eb48e81. * Fix param init (NVIDIA#2033) Signed-off-by: Chen Cui <[email protected]> * Hotfix to unit tests on hopper FA3 (NVIDIA#2143) * Add BytesIO to safe_globals (NVIDIA#2074) * add deprecation warning for legacy tokenizer system (NVIDIA#2145) Signed-off-by: dimapihtar <[email protected]> * replay: ci: Bump LTS container (NVIDIA#2157) Signed-off-by: oliver könig <[email protected]> * Hotfix to unit tests on hopper FA3 (bis) (NVIDIA#2179) * Fix has_modelopt_state() for native Torch checkpoint format (NVIDIA#2160) Signed-off-by: Asha Anoosheh <[email protected]> * chore: Remove codeowners (NVIDIA#2175) Signed-off-by: oliver könig <[email protected]> * Fix FP8 inference with sequence parallelism (NVIDIA#2009) Signed-off-by: Keshav Santhanam <[email protected]> Co-authored-by: Teodor-Dumitru Ene <[email protected]> * Replace ModelOpt generation server (NVIDIA#2147) Signed-off-by: Asha Anoosheh <[email protected]> * Add hybrid model support for dynamic inference engine (NVIDIA#1907) Signed-off-by: Keshav Santhanam <[email protected]> Co-authored-by: Teodor-Dumitru Ene <[email protected]> * Async task and event loop safety in Megatron Core (NVIDIA#2025) Co-authored-by: Robert Kirby <[email protected]> * Rename skip_prompt_log_probs (NVIDIA#2181) * Dynamic inference context | UVM only. (NVIDIA#1983) Co-authored-by: Robert Kirby <[email protected]> Co-authored-by: Mcore Bot <[email protected]> Co-authored-by: Teodor-Dumitru Ene <[email protected]> * Update copy-pr-bot.yaml [skip ci] Signed-off-by: oliver könig <[email protected]> * Revert "Dynamic inference context | UVM only. (NVIDIA#1983)" This reverts commit d6979d6. * ci: Run `auto-update-copy-pr-bot` only on forks (NVIDIA#2191) Signed-off-by: oliver könig <[email protected]> * Inference throughput tests: refactor goldens to be in list format (NVIDIA#2072) * Enable TE custom quantization recipe (NVIDIA#2005) Signed-off-by: Evgeny <[email protected]> Signed-off-by: root <Evgeny> Co-authored-by: oliver könig <[email protected]> Co-authored-by: root <Evgeny> * Add MoE parameters to ModelOpt pruning example + conf fixes (NVIDIA#2205) Signed-off-by: Keval Morabia <[email protected]> * Add repr to pg collection class (NVIDIA#2089) Co-authored-by: Jared Casper <[email protected]> * Move `data_samplers.py` from `legacy` to `training.datasets` & add `DistributedSignalHandler` to DataLoader workers (NVIDIA#2068) * Fix Megatron-FSDP checkpoint save failure (NVIDIA#2138) --------- Signed-off-by: oliver könig <[email protected]> Signed-off-by: dimapihtar <[email protected]> Signed-off-by: Youngeun Kwon <[email protected]> Signed-off-by: Youngeun <[email protected]> Signed-off-by: Maanu Grover <[email protected]> Signed-off-by: ykarnati <[email protected]> Signed-off-by: Deepak Narayanan <[email protected]> Signed-off-by: Charlie Truong <[email protected]> Signed-off-by: Keshav Santhanam <[email protected]> Signed-off-by: Ananth Subramaniam <[email protected]> Signed-off-by: Chen Cui <[email protected]> Signed-off-by: Asha Anoosheh <[email protected]> Signed-off-by: Evgeny <[email protected]> Signed-off-by: root <Evgeny> Signed-off-by: Keval Morabia <[email protected]> Co-authored-by: oliver könig <[email protected]> Co-authored-by: Teodor-Dumitru Ene <[email protected]> Co-authored-by: Dmytro Pykhtar <[email protected]> Co-authored-by: Youngeun Kwon <[email protected]> Co-authored-by: Lawrence McAfee <[email protected]> Co-authored-by: Mcore Bot <[email protected]> Co-authored-by: Maanu Grover <[email protected]> Co-authored-by: Lawrence McAfee <[email protected]> Co-authored-by: AJ Schmidt <[email protected]> Co-authored-by: Yashaswi Karnati <[email protected]> Co-authored-by: Deepak Narayanan <[email protected]> Co-authored-by: helen ngo <[email protected]> Co-authored-by: Robert Kirby <[email protected]> Co-authored-by: kanz-nv <[email protected]> Co-authored-by: Siddharth Singh <[email protected]> Co-authored-by: Charlie Truong <[email protected]> Co-authored-by: Keshav Santhanam <[email protected]> Co-authored-by: Asha Anoosheh <[email protected]> Co-authored-by: wdykas <[email protected]> Co-authored-by: root <[email protected]> Co-authored-by: William Dykas <[email protected]> Co-authored-by: root <[email protected]> Co-authored-by: Ananth Subramaniam <[email protected]> Co-authored-by: Chen Cui <[email protected]> Co-authored-by: Teodor-Dumitru Ene <[email protected]> Co-authored-by: Robert Kirby <[email protected]> Co-authored-by: Evgeny Tsykunov <[email protected]> Co-authored-by: Keval Morabia <[email protected]> Co-authored-by: Jared Casper <[email protected]> Co-authored-by: Antoni-Joan Solergibert <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What does this PR do ?
Add extra RL files
Contribution process
flowchart LR A[Pre-checks] --> B[PR Tests] subgraph Code Review/Approval C1[Expert Review] --> C2[Final Review] end B --> C1 C2 --> D[Merge]Pre-checks
Core 0.8)Code review
The following process is enforced via the CODEOWNERS file for changes into
megatron/core. For changes outside ofmegatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.For MRs into `main` branch
(Step 1): Add PR label
Expert Review(Step 2): Collect the expert reviewers reviews
Expert Reviewlabel when your PR is ready for review.Final Review might get declined if these requirements are not fulfilled.
(Step 3): Final Review
Final Reviewlabel(Optional Step 4): Cherry-pick into release branch
If this PR also needs to be merged into
core_r*release branches, after this PR has been merged, selectCherry-pickto open a new PR into the release branch.For MRs into `dev` branch
The proposed review process for `dev` branch is under active discussion.MRs are mergable after one approval by either
[email protected]or[email protected].Merging your PR
Any member of core-adlr and
core-nemowill be able to merge your PR.